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Abstract — This paper considers the following stochastic con- 
trol problem that arises in opportunistic spectrum access: a 
system consists of n channels where the state ("good" or 
"bad") of each channel evolves as independent and identically 
distributed Markov processes. A user can select exactly k 
channels to sense and access (based on the sensing result) in 
each time slot. A reward is obtained whenever the user senses 
and accesses a "good" channel. The objective is to design a 
channel selection policy that maximizes the expected discounted 
total reward accrued over a finite or infinite horizon. In our 
previous work we established the optimality of a greedy policy 
for the special case of k — 1 (i.e., single channel access) under 
the condition that the channel state transitions are positively 
correlated over time. In this paper we show under the same 
condition the greedy policy is optimal for the general case of 
k > 1; the methodology introduced here is thus more general. 
This problem may be viewed as a special case of the restless 
bandit problem, with multiple plays. We discuss connections 
between the current problem and existing literature on this 
class of problems. 

I. Introduction 

We consider the following stochastic control problem: 
there are n uncontrolled Markov chains, each an indepen- 
dent, identically-distributed, two-state discrete-time Markov 
process. The two states will be denoted as state 1 and state 
and the transition probabilities are given by pij, i,j = 0, 1. 

The system evolves in discrete time. In each time instance, 
a user selects exactly k out of the n processes and is allowed 
to observe their states. For each selected process that happens 
to be in state 1 the user gets a reward; there is no penalty for 
selecting a channel that turns out to be state but each such 
occurrence represents a lost opportunity because the user is 
limited to selecting only k of them. The ones that the user 
does not select do not reveal their true states. Out objective is 
to derive a selection strategy whose total expected discounted 
rewarded over a finite or infinite horizon is maximized. 

This is a Markov decision process (or MDP) problem [?]. 
Furthermore, it is a partially observed MDP (or POMDP) 
problem [?] due to the fact that the states of the underlying 
Markov processes are not fully observed at all times and that 
as a consequence the system state as perceived by the user 
is in the form of a probability distribution, also commonly 
referred to as the information state of the system [?]. This 
problem is also an instance of the restless bandit problem 
with multiple plays [?], [?], [?]. More discussion on this 
literature is provided in section ??. 
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The application of the above problem abstraction to mul- 
tichannel opportunistic access is as follows. Each Markov 
process represents a wireless channel, whose state transitions 
reflect dynamic changes in channel conditions caused by 
fading, interference, and so on. Specifically, we will consider 
state 1 as the "good" state, in which a user (or transmitter) 
can successfully communicate with a receiver; state is the 
"bad" state, in which communication will fail. The channel 
state is assumed to remain constant within a single discrete 
time step. A multichannel system consists of n distinct 
channels. A user who wishes to use a particular channel at 
the beginning of a time step must first sense or probe the state 
of the channel, and can only transmit in a channel probed 
to be in the "good" state in the same time step. The user 
cannot sense and access more than k channels at a time due 
to hardware limitations. If all k selected channels turn out to 
be in the "bad" state, the user has to wait till the beginning 
of the next time step to repeat the selection process. 

This model captures some of the essential features of 
multichannel opportunistic access as outlined above. On the 
other hand, it has the following limitations: the simplicity 
of the iid two-state channel model; the implicit assumption 
that channel sensing is perfect and the lack of penalty if the 
user transmits in a bad channel due to imperfect sensing; 
and the assumption that the user can select an arbitrary 
set of k channels out of n (e.g., it may only be able to 
access a contiguous block of channels due to physical layer 
limitations). Nevertheless this model does allow us to obtain 
analytical insights into the problem, and more importantly, 
some insight into the more general problem of restless 
bandits with multiple plays. 

This model has been used and studied quite extensively 
in the past few years, mostly within the context of oppor- 
tunistic spectrum access and cognitive radio networks, see 
for example [?], [?], [?], [?]. [?] studied the same problem 
and proved the optimality of the greedy policy in the special 
case of k = l,n = 2, [?] proved the optimality of the 
greedy policy in the case of k = n — 1, while [?], [?] 
looked for provably good approximation algorithms for a 
similar problem. Furthermore, the indexability (in the context 
of Whittle's heuristic index and indexability definition [?]) 
of the underlying problem was studied in [?], [?]. 

Our previous work [?] established the optimality of the 
greedy policy for the special case of k = 1 for arbitrary n 
and under the condition p\x > Poi, i-e., when a channel's 
state transitions are positively correlated. In this sense, the 
results reported in the present paper is a direct generalization 
of results in [?], as we shall prove the optimality of the 
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limited to selecting only k of them. The ones that the user 
does not select do not reveal their true states. Out objective is 
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rewarded over a finite or infinite horizon is maximized. 
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Furthermore, it is a partially observed MDP (or POMDP) 
problem [?] due to the fact that the states of the underlying 
Markov processes are not fully observed at all times and that 
as a consequence the system state as perceived by the user 
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The application of the above problem abstraction to mul- 
tichannel opportunistic access is as follows. Each Markov 
process represents a wireless channel, whose state transitions 
reflect dynamic changes in channel conditions caused by 
fading, interference, and so on. Specifically, we will consider 
state 1 as the "good" state, in which a user (or transmitter) 
can successfully communicate with a receiver; state is the 
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state is assumed to remain constant within a single discrete 
time step. A multichannel system consists of n distinct 
channels. A user who wishes to use a particular channel at 
the beginning of a time step must first sense or probe the state 
of the channel, and can only transmit in a channel probed 
to be in the "good" state in the same time step. The user 
cannot sense and access more than k channels at a time due 
to hardware limitations. If all k selected channels turn out to 
be in the "bad" state, the user has to wait till the beginning 
of the next time step to repeat the selection process. 

This model captures some of the essential features of 
multichannel opportunistic access as outlined above. On the 
other hand, it has the following limitations: the simplicity 
of the iid two-state channel model; the implicit assumption 
that channel sensing is perfect and the lack of penalty if the 
user transmits in a bad channel due to imperfect sensing; 
and the assumption that the user can select an arbitrary 
set of k channels out of n (e.g., it may only be able to 
access a contiguous block of channels due to physical layer 
limitations). Nevertheless this model does allow us to obtain 
analytical insights into the problem, and more importantly, 
some insight into the more general problem of restless 
bandits with multiple plays. 

This model has been used and studied quite extensively 
in the past few years, mostly within the context of oppor- 
tunistic spectrum access and cognitive radio networks, see 
for example [?], [?], [?], [?]. [?] studied the same problem 
and proved the optimality of the greedy policy in the special 
case of k = l,n = 2, [?] proved the optimality of the 
greedy policy in the case of k = n — 1, while [?], [?] 
looked for provably good approximation algorithms for a 
similar problem. Furthermore, the indexability (in the context 
of Whittle's heuristic index and indexability definition [?]) 
of the underlying problem was studied in [?], [?]. 

Our previous work [?] established the optimality of the 
greedy policy for the special case of k = 1 for arbitrary n 
and under the condition p\x > Poi, i-e., when a channel's 
state transitions are positively correlated. In this sense, the 
results reported in the present paper is a direct generalization 
of results in [?], as we shall prove the optimality of the 



greedy policy under the same condition but for any n > 
k > 1. The main thought process used to prove this more 
general result derives from that used in [?]. However, there 
were considerable technical difficulties we had to overcome 
to reach the conclusion. 

In the remainder of this paper we first formulate the prob- 
lem in Section [II] present preliminaries in Section [TTTJ and 
then prove the optimality of the greedy policy in Section [TV] 
We discuss our work within the context of restless bandit 
problems in Section [V] Section [VT| concludes the paper. 

II. Problem Formulation 

As outlined in the introduction, we consider a user trying 
to access the wireless spectrum pre-divided into n indepen- 
dent and statistically identical channels, each given by a two- 
state Markov chain. The collection of n channels is denoted 
by Af, each indexed by i = 1, 2, • • • , n. 

The system operates in discrete time steps indexed by t, 
t = 1, 2, • • • , T, where T is the time horizon of interest. At 
time t~, the channels go through state transitions, and at time 
t the user makes the channel selection decision. Specifically, 
at time t the user selects k of the n channels to sense, the 
set denoted by a k C Af. 

For channels sensed to be in the "good" state (state 1), 
the user transmits in those channels and collects one unit of 
reward for each such channel. If none is sensed good, the user 
does not transmit, collects no reward, and waits until t + 1 to 
make another choice. This process repeats sequentially until 
the time horizon expires. 

The underlying system (i.e., the n channels) is not fully 
observable to the user. Specifically, channels go through 
state transition at time t~ (or anytime between (t — l,i)), 
thus when the user makes the channel sensing decision at 
time t, it does not have the true state of any channel at 
time t. Furthermore, upon its action (at time t + ) only k 
channels reveal their true states. The user's action space 
at time t is given by the finite set a k (t) C Af, where 
a k {t) = {ii, . . . ,i K }. 

We know (see e.g., [?], [?], [?]) that a sufficient statis- 
tic of such a system for optimal decision making, or the 
information state of the system [?], [?], is given by the 
conditional probabilities of the state each channel is in given 
all past actions and observations. Since each channel can 
be in one of two states, we denote this information state 
by Q{t) = [wi(t),--- ,co n (t)] £ [0,1]™, where Ui{t) is the 
conditional probability that channel i is in state 1 at time t 
given all past states, actions and observations Q. Throughout 
the paper u>i{t) will be referred to as the information state 
of channel i at time t, or simply the channel probability of 
i at time t. 

Due to the Markovian nature of the channel model, the 
future information state is only a function of the current 
information state and the current action; i.e., it is independent 
of past history given the current information state and action. 

1 Note that it is a standard way of turning a POMDP problem into a classic 
MDP problem by means of the information state, the main implication being 
that the state space is now uncountable. 



It follows that the information state of the system evolves as 
follows. Given that the state at time t is u)(t) and action a k (t) 
is taken, uii(t + 1) for i G a k (t) can take on two values: (1) 
pn if the observation is that channel i is in a "good" state; 
this occurs with probability Wj(t); (2) poi if the observation is 
that channel i is in a "bad" state; this occurs with probability 
1 — u>i. For any other channel j £ a k (t), with probability 1 
the corresponding u>j(t + 1) = T(u>j(t)) where the operator 
t : [0, 1] -> [0, 1] is defined as 

t(lj) := upn + (1 - w)poi, < w < 1 . (1) 

The objective is to maximize its total discounted expected 
reward over a finite horizon given in the following problem 
(P) (extension to infinite horizon is discussed in Section Ml: 

T 

(P): maxJ^(w) = max-E^fV ft' 1 RnAQ(t))p{l) = Q] 

t=l 

where < j3 < 1 is the discount factor, and R Kt (u!(t)) is 
the reward collected under state u)(t) when channels in the 
set a k (t) = TTt(u>(t)) are selected. 

The maximization in (P) is over the class of deterministic 
Markov policies @. An admissible policy tt, given by the 
vector 7r = [7rj., 7T2, ■ • • , 7Pr], is such that 7r t specifies a 
mapping from the current information state oj(t) to a channel 
selection action a k (t) = TT t (u)(t)) C {1,2,--- ,n}. This 
is done without loss of optimality due to the Markovian 
nature of the underlying system, and due to known results 
on POMDPs [?, Chapter 6]. 

III. Preliminaries 

The dynamic programming (DP) representation of prob- 
lem (P) is given as follows: 

Vt(u>) = max E[R a k(u))] 

a k <EAf,\a k \=k 

V t (ui) = max ( u>i + (3 ■ 

a k £Af,\a k \=k ^— ' 
i£a k 

ZiG{0,l}, iea k \i£a k J 

Vt+i(poi ; ■ ■ • ,Poi,r(oJj),pu, . . . ,pu)), (2) 

t = i,2,...,r-i. 

In the last term, the channel state probability vector consists 
of three parts: a sequence of poi's that represent those 
channels sensed to be in state at time t and the length of 
this sequence is the number of ij's equaling zero; a sequence 
of values r(u)j) for all j $ a k ; and a sequence of pn's that 
represent those channels sensed to be in state 1 at time t and 
the length of this sequence is the number of ij's equaling 
one. Note that the future expected reward is calculated by 
summing over all possible realizations of the k selected 
channels. 

2 A Markov policy is a policy that derives its action only depending on 
the current (information) state, rather than the entire history of states, see 
e.g., [?]. 



The value function Vt(u>) represents the maximum ex- 
pected future reward that can be accrued starting from time 
t when the information state is Q. In particular, we have 
Vi(ui) = max^ Jj,(u>), and an optimal deterministic Markov 
policy exists such that a = 7r t * (<D) achieves the maximum in 
® (see e.g., [?] (Chapter 4)). 

For simplicity of representation, we introduce the follow- 
ing notations: 

• poiM: this is the vector [poi;Poi; ■ ■ ■ >Poi] of length x; 

• pn[x]: this is the vector [pii,pn, • • • of length x. 

• We will use the notation: 

i<i<fc 

for li, ■ ■ ■ , If. £ {0, 1}. That is, given a vector of 0s and 
Is (total of k elements), q() is the probability that a set 
of k channels are in states given by the vector. 
With the above notation, Eqn (O can be written as 

V t (Q) = max (S~* + ■ 

a k eM,\a k \ = k ; 

i£a k 

h £{0,1}, i£a k 

V t+ i(poi[k -^h}, - ■ ■ ,T(Wj),pil\^2li]) ■ 

Solving (P) using the above recursive equation can be 
computationally heavy, especially considering the fact that 
Q is a vector of probabilities. It is thus common to consider 
suboptimal policies that are easier to compute and imple- 
ment. One of the simplest such heuristics is a greedy policy 
where at each time step we take an action that maximizes 
the immediate one-step reward. Our focus is to examine the 
optimality properties of such a simple greedy policy. 

For problem (P), the greedy policy under state uj — 
[ui 1 , uj 2 , ■ ■ ■ ,u> n ] is given by 

a k (ui) — are max } uji . (3) 

a k cN,\a k \=k ' 
i£a k 

That is, the greedy policy seeks to maximize the reward as 
if there were only one step remaining in the horizon. In 
the next section we investigate the optimality of this policy. 
Specifically, we will show that it is optimal in the case of 
Pn > Poi- This extends the earlier result in [?] that showed 
this to be true for the special case of k = 1. 

IV. Optimality of the Greedy Policy 

In this section we show that the greedy policy is optimal 
when pn > poi. The main theorem of this section is as 
follows. 

Theorem 1: The greedy policy is optimal for Problem 
(P) under the assumption that pn > poi- That is, for 

t = 1, 2, ■ • • , T, k < n, and Vu; = [wx, ■ ■ ■ , w n ] e [0, 1]™, 
we have 

V t k (Cu;z k (Q)) > V t k {Q;a k ), Va k C TV, (4) 

where z k (Q) is the subset whose elements (indices) cor- 
respond to the k largest values in Q, and V k (uj;a k ) the 
expected value of action a k followed by behaving optimally. 



Below we present a number of lemmas used in the proof 
of this theorem. The first lemma introduces a notation that 
allows us to express the expected future reward under the 
greedy policy. 

Lemma 1: There exist T n-variable functions, denoted by 
W k {u>\ t = 1, 2, • • ■ , T, each of which is a polynomial of 
order l[j and can be represented recursively in the following 
form: 

W k (oj) = J2 ^ 

n — l-\-l<.i<7i 

W k {uj)= + ^ 

71— l+l<i<n 

q(ln, ■ ■ ■ , ln+k-l) ■ 

l n ,l n -i,- ,Z„ +fc _iS{0,l} 

W k +1 {p Q i[k -^l^T^i), - ■ ■ ,r(o; n _ fc ),pii[^y) . 

The proof is easily obtained using backward induction on 
t given the recursive equation and noting that the mapping 
t() is linear. The detailed proof is thus omitted for brevity. 

A few remarks are in order on this function W k (uj). 

i) Firstly, when Q is given by an ordered vector 
[wi, uj 2 , ■ ■ ■ ,w„] with uji < uj 2 < ■ ■ ■ < L0 n , W k (ui) is 
the expected total discounted future reward (from t to 
T) by following the greedy policy. 

This follows from how the greedy policy works in 
the special case of pu > poi- Note that in this case 
the conditional probability updating function t(u>) is a 
monotonically increasing function, i.e., t(u>i) > t{u>2) 
for uji > ui%. Therefore the ordering of channel 
probabilities is preserved among those that are not 
observed. 

If a channel has been observed to be in state "1" 
(respectively "0"), its probability at the next step 
becomes p\\ > t(uj) (respectively poi < T ( UJ )) f° r 
any u> G [0,1]. In other words, a channel observed to 
be in state "1" (respectively "0") will have the highest 
(respectively lowest) possible probability among all 
channels. 

Therefore if we take the initial information state u)(l), 
order the channels according to their probabilities 
u>i{l), and sense the highest k channels (top k of the 
ordered list) with ties broken randomly, then following 
the greedy policy means that in subsequent steps we 
will keep a channel in its current position if it was 
sensed to be in state 1 in the previous slot; otherwise, 
it was observed to be in state and gets thrown to the 
bottom of the ordered list. The policy then selects the 
next top most (or rightmost) k channels on this new 
ordered list. This procedure is essentially the same as 
that given in the recursive expression of W(). 

ii) Secondly, when Q is not ordered, W k () reflects a 
policy that simply goes down the list of channels by 
the order fixed in tD, while each time tossing the ones 

3 Each function Wt is affine in each variable, when all other variables are 
held constant. 



observed to be to the end of the list and keeing those 
observed to be 1 at the top of the list, 
iii) Thirdly, the fact that is a polynomial of order 1 
and affine in each of its elements implies that 

W*(u lt ■ ■ ■ ,w n _2,y,x) 
-W t K ((Ji, ■ ■ ■ ,uj n - 2 ,x,y) 
= (x-y^Wfiuu--- ,w„_ 2 ,0,l)- 
W t K (io u --- , Wn _ 2 ,l,0)] . 

Similar results hold when we change the positions of x 
and y. To see this, consider the above as two functions 
of x and y, each having an x term, a y term, an xy 
term and a constant term. Since we are only swapping 
the positions of x and y in these two functions, the 
constant term remains the same, and so does the xy 
term. Thus the only difference is the x term and the 
y term, as given in the above equation. This linearity 
result is used later in our proof. 

The next lemma establishes a sufficient condition for the 
optimality of the greedy policy. 

Lemma 2: Consider Problem (P) under the assumption 
that Pu > Poi- To show that the greedy policy is optimal 
at time t given that it is optimal at t + l,t + 2, • • • , T, it 
suffices to show that at time t we have 

Wt(wi, ■■■ ,ujj,x,y,--- ,u n ) 
< • • • ,u>j,y,x, ■ ■ ■ ,u> n ), (5) 

for all x > y and all < j < n — 2, with j = implying 
Wf(x,y,w 3 ,~- ,w„) < W t k (y,x,u 3 ,--- ,w n ). 

Proof: Since the greedy policy is optimal from t + 1 
on, it is sufficient to show that selecting the best k channels 
followed by the greedy policy is better than selecting any 
other set of k channels followed by the greedy policy. If 
channels are ordered u>i < ■ ■ ■ < u>; L < ■ ■ ■ < ui n then the 
reward of the former is precisely given by W t (u>i, . . . , ui n ). 
On the other hand, the reward of selecting an arbitrary set a k 
of k channels followed by acting greedily can be expressed 
as W k (a k , a k ), where a k is the (increasingly) ordered set of 
channels not included in a . It remains to show that if Eqn 
© is true then we have W£{a k ,a k ) < W£(wi, . . . , u) n ). 
This is easily done since the ordered list {a k ,a k ) may be 
converted to uji, . . . , u> n through a sequence of switchings 
between two neighboring elements that are not increasingly 
ordered. Each such switch invokes Q, thereby maintaining 
the "<" relationship. □ 

Lemma 3: For < cji < u>2 < • • • < w n < 1, we have 
the following two inequalities for all t = 1, 2, • • • , T: 

(A) : 1 + W k (uj 2 ,--- ,w n ,wi) > WfiuJi,--- ,u n ) 

(B) : W t fe (wi,--- ,u}j,y,x,Wj +3 ,--- ,uj n ) > 

where x > y, < j < n — 2, and j ' = implies 



This lemma is the key to our main result and its proof, 
which uses a sample path argument, highly instructive. It is 
however also lengthy, and for this reason has been relegated 
to the Appendix. 

With the above lemmas, Theorem 1 is easily proven: 
Proof of Theorem 1: We prove by induction on T. When 
t = T, the greedy policy is obviously optimal. Suppose 
it is also optimal for all times t + 1, t + 2, • • • , T, under 
the assumption pu > poi- Then at time t, by Lemma |2] 
it suffices to show that W k (uii, ■ ■ ■ , u>j,x, y, ■ ■ ■ ,u> n ) < 
Wf(u>i, ■ ■ ■ , Uj, y,x, ■ ■ ■ , Lu n ) for all x > y and < j < 
n — 2. But this is proven in Lemma [3] □ 

V. Discussion 

While the formulation (P) is a finite horizon problem, 
the same result applies to the infinite horizon discounted 
reward case using standard techniques as we have done in 
our previous work [?], [?]. 

In the case of infinite horizon, the problem studied in this 
paper is closely associated with the class of multi-armed 
bandit problems [?] and restless bandit problems [?]. This is 
a class of problems where n controlled Markov chains (also 
called machines or arms) are activated (or played) one at a 
time. A machine when activated generates a state dependent 
reward and moves to the next state according to a Markov 
rule. A machine not activated either stays frozen in its current 
state (a rested bandit) or moves to the next state according 
to a possibly different Markov rule (a restless bandit). The 
problem is to decide the sequence in which these machines 
are activated so as to maximize the expected (discounted or 
average) reward over an infinite horizon. 

The multi-armed bandit problem was originally solved 
by Gittins (see [?]), who showed that there exists an index 
associated with each machine that is solely a function of that 
individual machine and its state, and that playing the machine 
currently with the highest index is optimal. This index has 
since been referred to as the Gittins index. The remarkable 
nature of this result lies in the fact that it decomposes the 
n-dimensional problem into n 1 -dimensional problems, as 
an index is defined for a machine independent of others. 
The restless bandit problem on the other hand was proven 
much more complex, and is PSPACE-hard in general [?]. 
Relatively little is known about the structure of its optimal 
policy in general. In particular, the Gittins index policy is 
not in general optimal [?]. 

When multiple machines are activated simultaneously, the 
resulting problem is referred to as multi-armed bandits with 
multiple plays. Again optimal solutions to this class of 
problems are not known in general. A natural extension to 
the Gittins index policy in this case is to play the machines 
with the highest Gittins indices (this will be referred to as the 
extended Gittins index policy below). This is not in general 
optimal for multi-armed bandits with multiple plays and an 
infinite horizon discounted reward criterion, see e.g., [?], 
[?]. However, it may be optimal in some cases, see e.g., 
[?] for conditions on the reward function, and [?] for an 



undiscounted case where the Gittins index is always achieved 
at time 1. Even less is known when the bandits are restless, 
though asymptotic results for restless bandits with multiple 
plays were provided in [?] and [?]. 

The problem studied in the present paper is an instance 
of the restless bandits with multiple plays (in the infinite 
horizon case). Therefore what we have shown in this paper 
is an instance of the restless bandits problem with multiple 
plays, for which the extended Gittins index policy is optimal. 

VI. Conclusion 

In this paper we studied a stochastic control problem that 
arose in opportunistic spectrum access. A user can sense 
and access k out of n channels at a time and must select 
judiciously in order to maximize its reward. We extend a 
previous result where a greedy policy was shown to be 
optimal in the special case of k = 1 under the condition that 
the channel state transitions are positively correlated over 
time. In this paper we showed that under the same condition 
the greedy policy is optimal for the general case of k > 1. 
This result also contributes to the understanding of the class 
of restless bandit problems with multiple plays. 

Appendix 

Proof of Lemma 3: We would like to show 

(A) : 1 + W k (cj 2 , ■ ■ • ,w n ,ui) > W k {uj 1: • • • ,w„) 

(B) : Wt(u!,--- ,ujj,y,x,Uj +3 ,--- ,w n ) > 

Wj°(w!,--- ,x,y,(j j+3 ,--- ,u n ), 

where x > y, < j < n — 2, and j = implies 
Wf(y,x,u 3 ,--- ,w n ) > W t k (x,y,uj 3 ,--- ,u n ). 

The two inequalities (A) and (B) will be shown together 
using an induction on t. For t — T, part (A) is true because 

LHS = l+^l+I]"=r l -fe+2 ti 'i ^ UJ n-k+l+Y,?=n-k+2 LJ i = 

RHS. Part (B) is obviously true for t — T since x > y. 

Suppose (A) and (B) are both true for t + 1, • • • , T. 
Consider time t, and we will prove (A) first. Note that in 
the next step, channel 1 is selected by the action on the LHS 
of (A) but not by the RHS, while channel n - k + 1 is 
selected by the RHS of (A) but not by the LHS. Other than 
this difference both sides select the same set of channels 
indexed n — k + 2, ■ • • , n. We now consider four possible 
cases in terms of the realizations of channels 1 and n — k+1. 

Case (A.l): channels 1 and n — k + 1 have the state 
realizations "0" and "1", respectively. 

We will use a sample-path argument. Note that while 
these two channels are not both observed by either side, the 
realizations hold for the underlying sample path regardless. 
In particular, even though the LHS does not select channel 
n — k + 1 and therefore does not get to actually observe the 
realization of "1", the fact remains that channel n — k + 1 
is indeed in state 1 under this realization, and therefore its 
future expected reward must reflect this. It follows that under 
this realization channel n — k + 1 will have probability pu 
for the next time step even though we did not get to observe 
the state 1. The same is true for the RHS. This argument 
applies to the other three cases and is thus not repeated. 



Conditioned on this realization, the LHS and RHS 
are evaluated as follows (denoted as {LHS\( ^} and 
{ RHS\ (o,i)}> respectively): 

{LHS\ (0 ,i)} 
= 1+ "i + P- 

n—k+2<i<n 

^ q(ln-k+2, ' ■ ■ , In) ' 

ln-k+2,- ,l n e{0,l} 

W t k +1 (p Q i[k -^2k},T(u 2 ), ■ ■ ■ , 

{AffS^o,!)} 
= 1+ ^ UJ.+/3- 

n— fc+2<i<ro 

q(ln-k+2, ■ ■ ■ ,ln) ■ 

in-fc+2,"' ,ir>G{0,l} 

Wt+i (Poi [ fc - li - x ]' r ( w i) = P"o, 

= {LHS\ {QA) } 

Case (A.2): channels 1 and n — 1 + 1 have the state 
realizations "1" and "1", respectively. 

= 1 + 1+ Ui + /3- 

n—k+2<i<n 

^ q(ln-k+2, ■ ■ ■ , In) ■ 

ln-k+2,--- A€{0,1} 

W t k +1 (p 01 [k-Yh-l],r(w2),--- , 

T(w n - k +l) =Pll,Pll[Y l i + 1 }) 5 

{RHS\ (hl) } 

= 1+ J2 "i+p- 

n—k+2<i<n 

q(ln-k+2, ■ ■ • , In) 1 

in-fc+2,-Ae{0,l} 

W t k +1 {p 01 [k-Y l i - 1],t(wi) =pn, 
t(u 2 ), ■ ■ ■ ,r(u) n - k ),Pii[Y l i + !]) 

< 1+ 

n—k+2<i<n 

q(ln-k+2, ■ ■ ' , In) • 

ln-k+2,- ,i n e{o,i] 

Wf +1 (p i[k ~Y li ~ i]' 7 "^), • • • ,r(w n _ fe ), 
= {LHS\ (ltl) }-l<{LHS\ (1>1) } 



where the first inequality is due to the induction hypothesis realizations "1" and "0", respectively. 
of(B). 

Case (A. 3): channels 1 and n — 1 + 1 have the state 
realizations "0" and "0", respectively. 



{RHS\ (0 . 0) } 

n—k+2<i<n 

^ q(ln-k+2, ■ ■ ■ , In) ■ 

ln-k+2,--- ,z„e{o,i} 

Wf +1 {p i[k - ^/i],r(wi) =poi,r(w 2 ),--- ,r(w n -k), 



{Xi?5|( 0;0 )} 

= 1+ "i+p- 

n—k+2<i<n 

^ q(ln-k+2, ■ ■ ■ , in) ■ 

ln-k+2,--- ,*„e{o,i} 

W t fe +1 (poi[fc - £ii],r(w 2 ), • • • ,r(w„_ fe ), 
T(w„_ fc+ i) = Poi,Pn[£ii]) 

> 1 + £ Wi + /3- 

n— k+2<i<n 

^ q(ln-k+2, ■ ■ ■ ,ln) ■ 

in-M-2,- Ae{o,i} 

T^t+iboilfc - y^^]: T (^2), • • • ,r(w n _ fe ), 

> ^ Wj + /3 • 

n— &+2<2<n 

^ q{ln-k+2, ■■ ■ ,ln) ■ 

ln-k+2,--- ,'ne{0,l} 

(i + w^ +1 (p i[fe-X!y. T M."- > r ( 

> ^ wj + /3 • 

n— k+2<i<n 

^ q(ln-k+2, ■ ■ ■ ,ln) ■ 

ln-k+2,- ,Ue{0A} 

Wf +1 (p i,Poi[k - £i;],T(u; 2 ), • • • ,r(w„_ fe ), 

= {ilffS| (0 ,o)} 

where the first inequality is due to the induction hypothesis of 
(B), the last inequality due to the induction hypothesis of (A). 
Also, the second inequality utilizes the total probability over 
the distribution q(l n - k +2, ■■■ ,l n ) and the fact that (3 < 1. where the first and last inequalities are due to the induction 
Case (A.4): channels 1 and n — 1 + 1 have the state hypothesis of (B), the third due to the induction hypothesis 



{RHS\ {1 . 0) } 

£ Wi + p ■ 

n—k+2<i<n 

^ q{ln-k+2, ■ ■ ■ ,ln) ■ 

ln-k+2,- ,l n £{0,l} 

Wf +1 (p i[k - £y,r(wi) =pn,r(w 2 ), ■ •■ , 
r{u n -k),Pii\y] k]) 

{LHS\ (m } 
= 1 + 1+ £ u>i + 0- 

n— fc+2<z<n 
£ q{ln-k+2, ■ ■ ■ , in) ■ 

in-k+2, ■■■ ,i„e{o,i} 

W t +i(Poi[* - £ii - 1],t(w 2 ), • • • , r(w n _ fe ), 
T(w„_ fc+ i) =poi,pn[y^ h + 1]) 

> 1 + 1+ £ u>i + 0- 

n—k+2<i<n 
£ q(ln-k+2, ■ ■ ■ ,ln) ■ 

ln-k+2,- ,i n e{o,i} 

W t k +1 (p i[k - - l],r(w 2 ), • • • ,T(w n _ fc ), 

Pn[£ ii + l],Poi) 

> 1+ £ Wi + /3- 

n— &+2<2<n 

£ q(ln-k+2, ■ ■ ■ , in) ' 

«„-fc+2,-,«ne{0,l} 

(l + W t fc +1 (poi[fc - £ Z 4 - 1],tM, • • • , 
r(w n _ fe ),pii[£ ij + l],Poi)) 

> 1+ £ 

n— k+2<i<n 

^ q{ln-k+2, ■ ■ ■ , in) ■ 

«n-*+2,- ,i„£{0,l} 

W t fc +1 (p i[fc - £ii] ; T ( w 2), • • • ,r(w„_ fe ), 

> 1+ £ + 

n— k+2<i<n 

£ q{ln-k+2, ■ ■ ■ , in) ' 

in-fc+2," ,J„e{o,i} 

W"t+i(Poi[fc - y^y.Pii- 7 "^), • • • ,r(o; n _fc), 
= l + {ilffS| (li0) }>{ilff5|(i,o)} 



of (A). 

With these four cases, we conclude the induction step of 
proving (A). We next prove the induction step of (B). We 
consider three cases in terms of whether x and y are among 
the top k channels to be selected in the next step. 

Case (B.l): both x and y belong to the top k positions on 
both sides. In this case there is no difference between the 
LHS and RHS along each sample path, since both channels 
will be selected and the result will be the same. 

Case (B.2): neither x nor y is among the top k positions 
on either side. This implies that j < n — k — 2. We have: 



LHS 

E ^ + $ ■ 

n— fe+2<i<n 

22 q(ln-k+2, • • • , In) • 

l n -k+2,~ Ae{o,i} 

W t k +1 (poi [k-J2 ^ > T ( Wl J'""' T (^')' 
T(y),T(x),r(ujj +3 ),--- ,Pu[^2k]) ; 



However, we have 



Wt{ui,--- ,W n -k-t, l,0,w n _ fe+2 , • • ■ ,U! n ) 



RHS 

]T ^ + & • 

n— fe+2<i<ri 

^ ?(ira-fc+2j ' ' ' Jn) ' 

Wt+1 (POI [ fc - 51 ^ ' T ( Wl )'•'•' T (^')' 

t(x), r(y), r(w j+3 ), • ■ • ,Pn[^ '<]) 



n— fe+2<?<n 

<l(ln-k+2, ■ ■ ■ , In) ■ 

;„_ fc+2 ,--,;„e{o,i} 

< J] + • 

n— fc+2<z<n 

q(ln-k+2! ■ ■ ■ ,ln) ■ 

(l + W t k +1 (p 01 [fc - ^ l 4 - 1] , r(wi), • ■ • , 
r(o; n -fc-i),fflu[y^ i, + l],j>oi)J 

< ^ + {3- 

n— k J r2<i<.n 

^ q(ln-k+2, ■ ■ ■ Jn) ■ 

l n - k+ a,- A6{0,1} 

(l + W* +1 (p 01 [fc - U - !] . r M . ■ • • . 

r(cJ n _ fe _i),poi,Pii[Xl + 

< 1 + ]T uj t +(3- 

n— fe+2<i<n 

?(^n-fc+2, • ■ • , in) ■ 

Wt+i(Poi[& ~ ~ 1 ]> T ( w i) ! ' ' ' ,T(a; n _fc_i), 

Poi,Pn[5^ ^ + 1]) 

= Wt{uji r -- ,o; n _fe_i,0, l,w n _fc+2, ••■ J^n) 

Since x > y, we have LiJS* > RHS in Eqn (|6]). This 
concludes the induction step of (B). 



where the last inequality is due to the monotonicity of r() 
and the induction hypothesis of (B). 

Case (B.3): exactly one of the two belongs to the the top 
k channels on each side. This implies that j = n — k — 1. 
By the linearity of the function we have the following: 



W t fe (u;i,--- ,U n -k-l,V,X,Un-k+2, ■ ■ ■ ,w„) 
— W t (ui,--- , LO n -k-l,X,y, Uln-k+2, ■ ■ ■ ,W») 

= (x — y)(Wt(coi,--- ,w n _k_i,0, l,Ci>n-fc+2, ••■ ,w„)- 
W^cji,-- - l,0,w n _fc + 2, ■ • • (6) 



